Goto

Collaborating Authors

 lemma 5


The Computational Complexity of Counting Linear Regions in ReLU Neural Networks

Neural Information Processing Systems

An established measure of the expressive power of a given ReLU neural network is the number of linear regions into which it partitions the input space. There exist many different, non-equivalent definitions of what a linear region actually is. We systematically assess which papers use which definitions and discuss how they relate to each other. We then analyze the computational complexity of counting the number of such regions for the various definitions. Generally, this turns out to be an intractable problem. We prove NPand #P-hardness results already for networks with one hidden layer and strong hardness of approximation results for two or more hidden layers. Finally, on the algorithmic side, we demonstrate that counting linear regions can at least be achieved in polynomial space for some common definitions.


Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization

Neural Information Processing Systems

Hierarchical optimization refers to problems with interdependent decision variables and objectives, such as minimax and bilevel formulations. While various algorithms have been proposed, existing methods and analyses lack adaptivity in stochastic optimization settings: they cannot achieve optimal convergence rates across a wide spectrum of gradient noise levels without prior knowledge of the noise magnitude. In this paper, we propose novel adaptive algorithms for two important classes of stochastic hierarchical optimization problems: nonconvex-strongly-concave minimax optimization and nonconvex-strongly-convex bilevel optimization. Our algorithms achieve sharp convergence rates of eO(1/ T + ฯƒ/T1/4) in T iterations for the gradient norm, where ฯƒ is an upper bound on the stochastic gradient noise. Notably, these rates are obtained without prior knowledge of the noise level, thereby enabling automatic adaptivity in both low and high-noise regimes. To our knowledge, this work provides the first adaptive and sharp convergence guarantees for stochastic hierarchical optimization. Our algorithm design combines the momentum normalization technique with novel adaptive parameter choices. Extensive experiments on synthetic and deep learning tasks demonstrate the effectiveness of our proposed algorithms.


Learning from ASingle Markovian Trajectory: Optimality and Variance Reduction

Neural Information Processing Systems

In this paper, we consider the general stochastic non-convex optimization problem when the sampling process follows a Markov chain. This problem exhibits its significance in capturing many real-world applications, ranging from asynchronous distributed learning to reinforcement learning. In particular, we consider the worst case where one has no prior knowledge and control of the Markov chain, meaning multiple trajectories cannot be simulated but only a single trajectory is available for algorithm design. We first provide algorithm-independent lower bounds with โ„ฆ(ฯต 3) (and โ„ฆ(ฯต 4)) samples, when objectives are (mean-squared) smooth, for any first-order methods accessing bounded variance gradient oracles to achieve ฯต-approximate critical solutions of original problems. Then, we propose MarkovChain SPIDER (MaC-SPIDER), which leverages variance-reduced techniques, to achieve a O(ฯต 3) upper bound for mean-squared smooth objective functions. To the best of our knowledge, MaC-SPIDER is the first to achieve O(ฯต 3)complexity when sampling from a single Markovian trajectory. And our proposed lower bound concludes its (near) optimality.


Online Two-Stage Submodular Maximization

Neural Information Processing Systems

Given a collection of monotone submodular functions, the goal of Two-Stage Submodular Maximization (2SSM) [Balkanski et al., 2016] is to restrict the ground set so an objective selected u.a.r.


Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming

arXiv.org Machine Learning

Lagrangian Relaxation (LR) is a powerful technique for solving large-scale Mixed Integer Linear Programming (MILP), particularly those with decomposable structures, such as vehicle routing or unit commitment problems. By relaxing the coupling constraints, LR enables parallel subproblem solving and often yields tighter dual bounds than standard linear programming relaxations, which is crucial for efficient branch-and-bound pruning. While recent empirical work has shown promising results using machine learning to predict these multipliers, a theoretical understanding of such methods remains an open question. In this work, we bridge this gap by analyzing the problem of learning LR through the lens of Data-driven Algorithm Design, i.e., a statistical learning problem over a distribution of problem instances. Our contributions are as follows: first, we derive a generalization bound of $\mathcal{O}(s^{1.5}/\sqrt{N})$ for the learned multipliers, where $s$ is the number of coupling constraints and $N$ is the sample size. Second, we provide a minimax lower-bound of $ฮฉ(s/\sqrt{N})$, proving that a linear dependency is unavoidable. Third, we constructively close this theoretical gap by proving that Stochastic Gradient Ascent (SGA) with averaging achieves the minimax optimal rate $ฮ˜(s/\sqrt{N})$. Finally, we extend our framework to the learning-to-warm-start setting, proving that it achieves a fast, minimax-optimal rate of $ฮ˜(s/N)$ and establishing a theoretical advantage over direct multiplier prediction.


Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

arXiv.org Machine Learning

\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $ฮต^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $ฮต^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(ฮต^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(ฮต^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.


The Causal Description Gap: Information-Theoretic Separations Across Pearl's Hierarchy

arXiv.org Machine Learning

Pearl's causal hierarchy shows that observational, interventional, and counterfactual queries are qualitatively distinct. We ask a quantitative version of this question: how many additional bits are needed to specify higher-rung causal answers once lower-rung answers are known? We formalize this via query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries. Our main construction gives binary acyclic SCMs whose observational distribution has constant description length, while the single-variable interventional answer oracle has description length $ฮ˜(n^2)$. A degree-sensitive upper bound shows that finite-gate-schema SCMs of indegree $d$ have observational-interventional gap at most $O(nd \log(en/d) + n \log n)$, making the quadratic construction order-optimal in the dense regime and a rooted-tree construction order-optimal for bounded indegree. The quadratic separation persists under $\varepsilon$-accurate total-variation descriptions for every fixed $\varepsilon < 1/4$. At the next rung, the full hard-do interventional oracle can still leave a $ฮ˜(n)$ counterfactual description gap. A general ambiguity-to-bits theorem and Shannon analogue show that these gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms.




Double Randomized Underdamped Langevin with Dimension-Independent Convergence Guarantee

Neural Information Processing Systems

This paper focuses on the high-dimensional sampling of log-concave distributions with composite structures: p (dx) exp( g(x) f(x))dx. We develop a double randomization technique, which leads to a fast underdamped Langevin algorithm with a dimension-independent convergence guarantee.